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ABSTRACT: 

We show that a color algorithm capable of separating illumination from 
reflectance in a Mondrian world can be learned from a set of examples. 
The learned algorithm is equivalent to filtering the image data - in which 
reflectance and illumination are intermixed - through a center-surround re- 
ceptive field in individual chromatic channels. The operation resembles the 
"retinex" algorithm recently proposed by Edwin Land. This result is a spe- 
cific instance of our earlier result that a standard regularization algorithm 
can be learned from examples. It illustrates that the natural constraints 
needed to solve a problem in inverse optics can be extracted directly from 
a sufficient set of input data and the corresponding solutions. The learning 
procedure has been implemented as a parallel algorithm on the Connection 
Machine System. 
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Computational vision has derived effective solutions to early vision prob- 
lems such as edge detection, stereopsis and structure from motion by exploit- 
ing general constraints on the imaging process and the natural world. An 
important question is: how does a visual system learn the algorithms it uses? 
Can they - and the underlying natural constraints - be learned automatically 
from sets of examples? We have explored these questions for the computa- 
tional problem of color vision, and implemented a parallel learning procedure 
on the Connection Machine System. 

Color constancy points to a difficult computation underlying human 
color vision. We do not merely discriminate between different wavelengths 
of light; we assign roughly constant colors to objects even though the inten- 
sity signals they send to our eyes change as the illumination varies across 
space and chromatic spectrum. Perfect color constancy would result from 
a computation that extracts the invariant spectral reflectance properties of 
surfaces from the image intensity signal, in which reflectance and illumina- 
tion are mixed. The fact that the colors we see are not exactly invariant 
suggests that our visual system performs a computation with a similar goal, 
but less exact results. The computation is typical of the difficult problems of 
inverse optics, in which the information supplied by a two-dimensional image 
is insufficient by itself to solve for a unique three-dimensional scene. Natural 
constraints must be found and applied to the problem in order to solve it. 

"Retinex" lightness algorithms, pioneered by Land (Land, 1959, 1985, 
1986; Land and McCann, 1971) and explored by others (Horn, 1974; Blake, 
1985; Hurlbert, 1986) illustrate one successful approach to the computation. 
The retinex algorithms assume that the visual system performs the same 
computation in each of three independent chromatic channels. The algo- 
rithms further assume that in each channel, the image intensity signal, or 
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more precisely, the image irradiance, s' , is proportional to the product of 
the illumination intensity e' and the surface spectral reflectance r' in that 
channel: 

s'(x, y) = r'(x, y)e'(x, y). (1) 

This form of the intensity equation makes the implicit assumption that the 
irradiance s' has no specular components and that the color channels have 
been chosen appropriately. 1 

Retinex algorithms seek to solve Equation (1) for lightness, which is an 
approximation to r'(x, y), in each channel. The resulting triplet of lightnesses 
labels a constant color in color space. To make a solution possible, retinex 
algorithms restrict their domain to a world of Mondrians, two-dimensional 
surfaces covered with patches of random colors (see Figure 1). The algo- 
rithms then make the explicit assumptions that (1) r'(x, y) is uniform within 
patches but has sharp discontinuities at edges between patches and that (2) 
e'(x,y) varies smoothly across the Mondrian. 

Most retinex algorithms first take the logarithm of both sides of Equa- 
tion (1), converting it to a sum: 

s(x,y) = r(x,y) + e(x,y), (2) 

where s = log s', r = log r' and e = log e'. The two assumptions are 
then exploited to break down the sum into its two components. 

The most recent retinex algorithm (Land, 1986) employs an operator 
that divides the image irradiance of Equation (1) at each location by a 
weighted average of the irradiance at all locations in a large surround. The 



For a detailed derivation of the intensity equation, see Appendix 1. 




Figure 1. A Mondrian under an illumination gradient, generated by adding together 
two 320x320 images: one is the (log) reflectance image, an array of rectangles 
each with a different, uniform grey-level (see Figure 2(a)); the other is the (log) 
illumination image, in which the pixel values increase linearly in the same way 
across each row. 

log of the operator's result is called lightness. The triplets calculated by 
the algorithm fall close to the colors humans see when viewing a Mondrian 
under illuminants with strong spatial gradients. The form of this operator is 
similar to that derived in our earlier formal analysis of the lightness problem 
(see Hurlbert, 1986). The main difference between the two is that the ana- 
lytically derived operator takes the log of irradiance before averaging, and so 
is linear in the logs, whereas Land's algorithm averages before taking the log, 
and so is not linear in the logs. As discussed below, the numerical difference 
between the two results is small. 

We set out to see whether a simple algorithm could learn from examples 
how to extract reflectance from image irradiance, and whether what it would 



learn would resemble one of the above algorithms. The examples we use are 
pairs of images: an input image of a Mondrian under illumination that varies 
smoothly across space and an output array that displays the reflectance of the 
Mondrian separately from the illumination. We then synthesize an operator 
from the examples by finding the linear estimator that best maps the input 
into its two output components, using optimal linear estimation techniques. 

For computational convenience we train the operator on one- dimensional 
vectors that represent horizontal scan lines across the Mondrian images (see 
Figure 2). We generate many different input vectors s by adding together 
different random r and e vectors, according to Equation (2). Each vector r 
represents a pattern of step changes across space, corresponding to one row 
of the output reflectance image (see Figure 3a). Each vector e represents a 
smooth gradient across space with a random offset and slope, corresponding 
to one row of the output illumination image. (In our implementation, we 
appended r to e to create an output vector twice as long as the input.) We 
then arrange the "training vectors" s and r as the columns of two matrices 
S and R, respectively. Our goal is then to compute the optimal solution L 
of 

LS = R, (3) 

where L is a linear operator represented as a matrix. 

It is well known that the solution of this equation that is optimal in the 
least squares sense is 

L = RS+, (4) 

where S + is the Moore-Penrose pseudoinverse (see, for example, Albert 
1972). We compute L using the technique of regularizing the pseudoinverse to 
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Figure 2. (a) and (b) A pair of one-dimensional examples like those used to train the 
algorithm, (a) shows the input data, which is a random Mondrian reflectance pat- 
tern superimposed on a linear illumination gradient with a random slope and offset. 
Each example can be thought of as a horizontal scan line across a Mondrian such 
as the one in Figure 1 (which was generated by stacking similar one-dimensional 
examples). Each example, 320 pixels in length, has a different reflectance pattern 
and a different linear illumination gradient, (b) shows the corresponding output 
solution, in which the illumination and the reflectance have been separated and 
concatenated. We used 1500 such pairs of input-output examples to train the op- 
erator shown in Figure 4. (c) shows the result obtained by the trained operator 
when it acts on the input data (a), not part of the training set. It should be com- 
pared with (b). This result is fairly typical: in some cases the prediction is even 
better, in others it is worse. 



obviate numerical stability problems (see Appendix 3). The number of exam- 
ples we use is significantly larger than the number of elements in each vector 
in S, in order to overconstrain the problem. Once L is computed it is tested 
on new scan lines, generated in the same way. The pseudoin verse may also 
be computed by recursive techniques which improve its form as more data 
become available (see Appendix 3). The latter procedure, although equiva- 
lent to our "one-shot" computing technique, may seem intuitively more like 
learning. 

We find that the trained operator L, given a new 5 as input, recovers 
a good approximation to the correct output vectors r and e. Operating 
on a two-dimensional Mondrian, generated by stacking appropriately many 
one-dimensional s vectors, L also yields a satisfactory approximation to the 
correct output image (see Figure 3b). 

It seems that our scheme has successfully learned an algorithm that per- 
forms the color computation correctly in a Mondrian world. What algorithm 
has been learned? What is its relationship to the filters described above? 
To answer these questions we examine the structure of the matrix L. We 
expect that because the operator should perform the same action on each 
point in the image, i.e. that it should be space-invariant, the central part 
of L should be a convolution matrix, in which each row is the same as the 
row above but displaced by one element to the right. In the peripheral parts 
of the matrix, this form will be corrupted by boundary effects. Inspection 
of the matrix and appropriate averaging of the relevant rows (see Figure 4) 
confirm these expectations. Like Land's psychophysical^ tested filter, it has 
a narrow positive peak and a broad, shallow, negative surround (see Figure 
4) that extends beyond the range we can observe, but not over the entire 
image. 




Figure 3. (a) The (log) reflectance image that is one component of the Mondrian 
of Figure 1. This image represents the Mondrian of Figure 1 under uniform illumi- 
nation, (b) The (log) reflectance image that the trained operator produces when 
it acts on the Mondrian of Figure 1. The operator has been trained on a set of 
one-dimensional examples different from those used to generate the Mondrian of 
Figure 1. 



Our algorithm is not exactly identical with Land's: the filter of Figure 
4 subtracts from the value at each point the average value of the logs at 
all points in the field, rather than the log of the average values. As men- 
tioned above, the difference between the outputs of the two filters is small in 
most cases, and both agree well with psychophysical results (Land, personal 
communication). We have explicitly compared the performance of Land's 
scheme on the Mondrian of Figure 1 with a scheme equivalent to our learned 
algorithm (filtering the log of the image through a filter like that shown in 
Figure 4). The resulting arrays approximate the correct outputs equally well, 
and are very similar to each other. 
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Figure 4. The learned filter, which extracts reflectance from a Mondrian image 
under a linear illumination gradient. The operator that is learned is a matrix that 
acts on one-dimensional vectors. Assuming that the operator is space invariant 
when boundary effects can be ignored, we estimate the shape of the corresponding 
filter by summating the central rows of the matrix, shifting them appropriately. 
The one-dimensional "receptive field" that results has a sharp positive peak and a 
shallow surround that extends beyond the range we can estimate reliably, which is 
the range we show here. The filter shown here was learned from a set of examples 
with linear illumination gradients (see Figure 2). When logarithmic illumination 
gradients are used, a qualitatively similar receptive field is obtained. In a separate 
experiment we used a training set of one-dimensional Mondrians with either lin- 
ear illumination gradients or slowly varying sinusoidal illumination with random 
wavelength, phase and amplitude. The resulting filter is shown in the inset. The 
inhibitory surround clearly decays back to zero. 
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To investigate the way in which the shape of the learned operator varies 
with the type of illumination gradient on which it is trained, we constructed 
a second set of examples. In addition to vectors with random linear illumi- 
nation gradients, this set contained an equal number of vectors with random 
sinusoidally varying illumination components. The operator trained on this 
mixture of illumination types differs from the operator trained on strictly 
linear ones in the shape of its surround. Whereas the latter has a broad 
negative surround that remains virtually constant throughout its extent, the 
new operator's surround (see Figure 4b) has a smaller extent and returns 
smoothly to zero from its peak negative value in its center. 

The difference between the two operators illustrates an interesting fea- 
ture of the learning algorithm: it adapts to its environment. The results 
imply that the optimal operator for images with strictly linear illumination 
gradients is one whose surround takes a constant average over a range smaller 
than the entire image. On the other hand, the surround of the optimal op- 
erator for images with smoothly varying illumination gradients is a low-pass 
filter that separates the illumination from the sharply-varying reflectance. 

Our learning procedure is motivated by our previous observation (Poggio 
and Hurlbert, 1984; see also Poggio et al., 1985b) that standard regulariza- 
tion algorithms in early vision define linear mappings between input and 
output and therefore can be learned associatively under certain conditions 
(see Appendix 3). Our algorithm synthesizes the optimal linear operator L 
that maps as closely as possible, in the least squares sense, the image irradi- 
ance into its reflectance and illumination components for this class of images 
and illumination gradients. The technique of optimal linear estimation that 
it uses is closely related to optimal Bayesian estimation (see Albert, 1972 
and Appendix 4). 
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Figure 5. In a separate experiment we trained the operator on a set of 1500 one- 
dimensional Mondrians with either linear gradients of illumination or sinusoidally 
varying illumination with random wavelengths between 0.5 and 1.5 times the length 
of the pattern, and random phases and amplitudes. We show here how the operator 
performs on a new input pattern with a linear illumination gradient (left) and on a 
pattern with a sinusoidal illumination (right). As in Figure 2, (a) shows the input 
data, (b) shows the corresponding output solution, in which the illumination and 
the reflectance have been separated and concatenated, (c) shows the result the 
trained operator yields when it acts on the input data (a), not part of the training 
set. It should be compared with (b). The filter corresponding to this operator is 
shown in the inset of Figure 4. 



The learned L acts on new input images to yield good approximations 
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to the correct output images. Note that the result of applying the matrix L 
to a two-dimensional Mondrian image (see Figure 3) is not as good as the 
one-dimensional examples suggest. The reason is that the matrix L has been 
learned from one-dimensional examples, and in the case of Figure 3 it has 
been applied to each row of the Mondrian independently. Small errors in 
offsets and scaling factors for each row that would only slightly affect the 
one-dimensional result become more obvious as they vary from row to row 
in the two-dimensional case. 

Training of the operator on two-dimensional examples is possible, but 
computationally very expensive if done in the same way. The present com- 
puter simulations require several hours when run on standard serial com- 
puters (up to 20 hours on a Symbolics 3640 for 512x512 images). The two- 
dimensional case will need much more time. We expect that it will be feasi- 
ble only on the latest version of the Connection Machine (the 65K-processor 
CM-2 with floating point hardware) of Thinking Machines Corporation. Our 
one-dimensional learning scheme runs orders of magnitude faster on a CM- 
1 Connection Machine System with 16K-processors. It is possible to use 
much more efficient methods of computing the pseudoinverse and especially 
approximations to it (see for example Albert, 1972 and Kohonen, 1977). 

The calculation of a regularized pseudoinverse may also be implemented 
by parallel networks of simple processors or by analog networks that bear 
some resemblance to biological systems. In particular, it could be computed 
by so-called "neural" networks using a gradient descent method (also called 
back-propagation in recent papers, see Rumelhart et al., 1986) and linear 
units. Since the pseudoinverse is the best linear approximation in the L2 
norm, gradient descent minimizing the square error between the actual out- 
put and desired output of a fully connected linear network is guaranteed to 
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converge, albeit slowly. Thus gradient descent in weight space would give the 
same result we obtain, the global minimum. In terms of a network of linear 
units, the training scheme we have run, using one-dimensional Mondrians 
512 pixels long, corresponds to learning the equivalent of up to about half a 
million weights. 

The significance of our result lies not so much in the specific estimation 
technique we used, but in the form of the filter we obtain. It is qualitatively 
the same as that which results from the direct application of regularization 
methods exploiting the spatial constraints on reflectance and illumination 
described above (see Table 1 in Poggio et al., 1985a; Poggio and staff, 1985b; 
Appendix 2). The Fourier transform of the filter of Figure 4 is approxi- 
mately a bandpass filter that cuts out low frequencies due to slow gradients 
of illumination and preserves intermediate frequencies due to step changes 
in reflectance. The large "inhibitory" surround also provides normalization 
to average grey in the field (see Hurlbert, 1986). 

We do not think that our results mean that color constancy may be 
learned during a critical period by biological organisms. It seems more rea- 
sonable to consider them simply as a demonstration on a toy world that 
evolution may recover and exploit natural constraints hidden in the physics 
of the world. The shape of the filter in Figure 4 is suggestive of the "non- 
classical" receptive fields that have been found in V4, a cortical area im- 
plicated in mechanisms underlying color constancy (Desimone et. al., 1985; 
Wild et al., 1985; Zeki, 1983a,b). 

Finally, it is important to stress that the general solution of the problem 
of color constancy requires much more work: real images are noisy, objects 
are three-dimensional, and there are shading, shadows, and specularities. We 
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Appendix 1 
The Intensity Equation 



This appendix illustrates the transformation of the photoreceptor activ- 
ity necessary in order to decompose the intensity equation into a sum of two 
components representing the surface reflectance and the source illumination 
intensity. 

In most natural scenes, the light reflected from objects includes both 
diffuse and specular components. To simplify the intensity equation, we 
assume that all reflection is Lambertian, or that there are no specularities. In 
this case, the light reflected by a surface is solely the product of its spectral 
reflectance and the illumination intensity that falls on it. The amount of 
reflected light that reaches the eye further depends on the angles between the 
illumination source, the reflecting surface, and the eye, and the response of 
the eye to the light depends on the spectral sensitivity of its photoreceptors. 
We may therefore write the intensity signal registered by the eye as: 

s'"(x) = log I a"(\)r'(\,x)e'(\,x)d\, (Al.l) 

where v labels the spectral type of the photoreceptor (u = 1, ...4 for humans, 
counting 3 cone types and 1 rod type), a"(A) is the spectral sensitivity of 
the i/th-type photoreceptor and s' v {x) its activity. r'(\,x) is the surface 
spectral reflectance and e'(X,x) the effective irradiance. We group together 
the geometric factors influencing the intensity signal by defining the effective 
irradiance as the ambient illumination modified by the orientation, shape, 
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and location of the reflecting surface. The effective irradiance is the intensity 
that the surface would reflect if it were white, that is, if r'(A, x) = 1. 

Equation A 1.1 is not separable into a product of reflectance and illumi- 
nation components in a single color channel. To make it separable, we must 
make a transformation to new color channels that are combinations of the 
photoreceptor activities. We first choose basis functions p l (X) and q'(X) such 
that for most naturally-occurring illuminants and surface reflectances 

c , (A,a:)«^p*(A)c' i (x) 

(A1.2) 
r'(\,x)^^q i (X)r' i (x) 

i 

The basis transformation (for a review of the origins of this idea see 
Maloney, 1985; see also Buchsbaum, 1980 and Yuille, 1984) leads to the 
following equation 

s'"(x) = T^ixy'^x), (A1.3), 

where the tensor T is defined as 



T^ = Jd\a»{\)p\\)q>{\), 



where v = 1, ...4 and i, j = 1, ... TV, the p's and the q's are the basis functions 
for the illuminant and for the albedo, respectively, and the sum is taken over 
repeated indices. 

To simplify further analysis, we impose the conditions that the p* = q { 
and that the p % {\) are orthogonal with respect to the a" (A). This orthogonal- 
ity is insured if, for example, the p % {\) do not overlap with respect to A. In the 
simplest case, the basis functions may be monochromatic: p'(\) = 6(X — A,). 
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Substituting Al.2 into Al.l then yields: 



s"(x) = log J2 e'^xy'^x) f dXa"(X)p i (X)p i (X) (A1A) 

i J 

If we define the matrix T? = / dXa v (X)p i (X)p i (\), where v = 1, ...4 
and i = 1, ...N, we obtain 

antilog s' v (x)^=^T l/i e\x)r' i {x) (41.5) 

i 

If the p'(A) are suitably chosen, T v { is invertible. Then the linear equa- 
tions represented in Al.5 yield the following solution: 



(T„j) antilog s v (x) = e \x)r \x) 



(41.6) 



or, 



s l (x) = e^xy'^x) (Al.l) 

where s' l (x) = (T^) -1 antilog s'"(x) . Taking logarithms of A1.7 yields 



log s % (x) = log e \x) + log r *(x) 



or 



s\x) = e\x) + r\x) i = l,...,7V, 



(41.8) 



where s'(x) = log s *(x) and so on, which is the desired equation. The 
extension to the two-dimensional case is clear. 

s*(x) is a linear combination of the activity of different types of photore- 
ceptors. It is important to note that the index i labels not the color channels 
associated with the spectral sensitivities of the different photoreceptor types 
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Appendix 2 
A Regularization Algorithm for Color Computation 



2.1 The regularization approach 

Consider the equation 

y = Az (^2.1) 

where A is a known operator. The direct problem is to determine y given z. 
The inverse problem is to find z, given y - the function y becomes the "data" 
and z the solution. Although the direct problem is usually well-posed, the 
inverse problem is often ill-posed. 

Standard regularization theories for "solving" ill-posed problems have 
been developed by Tikhonov (Tikhonov and Arsenin, 1977) and others. The 
basic idea underlying regularization techniques is to restrict the space of 
acceptable solutions by choosing the one solution that minimizes an appro- 
priate functional. Among the methods that can be employed (see Poggio, et. 
al. 1985a), the main one is to find z that minimizes 

\\Az-y\\ 2 + \\\Pzf. (A2.2) 

The choice of the norm || • ||, usually quadratic as in Equation A2.2, and of 
the linear stabilizing functional \\Pz\\, is dictated by mathematical considera- 
tions, and most importantly, by a physical analysis of the generic constraints 
on the problem. The regularization parameter X controls the compromise 
between the degree of regularization of the solution and its closeness to the 
data. 
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2.2 Regularizing the Color Computation 

Equation Al.8 is impossible to solve in the absence of additional con- 
straints: there are twice as many unknowns as equations for each x. Formally, 
the problem is ill-posed. Regularization techniques can be used to restrict 
the number of allowed solutions for e x (x) and r l (x), and thereby reduce the 
number of unknowns, by imposing constraints on the imaging process and 
the physical properties of the surfaces and the source illuminant. 

One constraint that may be used is spatial regularization (for other con- 
straints - spectral regularization and the single source assumption - see Hurl- 
bert, 2001 and Poggio, 1985b). The spatial regularization constraint formal- 
izes and extends the retinex assumptions that (a) r % (x) is either constant or 
changes sharply at boundaries between different materials, and (b) e l (x) is ei- 
ther constant or changes more smoothly than r l (x) across space. One retinex 
algorithm (Horn, 1974), for example, imposes the strong constraint (in two- 
dimensions) that all values of VV(a;,y) strictly below a fixed threshold are 
due to e\x,y). The regularization assumption requires only that e'(x) vary 
less sharply across space than r*(x) and effectively allows the limit on the 
spatial variation of e l (x) to be reset for each scene. 

Standard regularization techniques enforce this constraint on Equation 
Al.8 by requiring that its solution minimize the following variational princi- 
ple: 



£[ a « - (r» + e')] 2 + A[A e »]2 + fi[G * r i + *,..-] } (A2 _ 3) 



where G is a gaussian filter with standard deviation a. The first term re- 
quires that the solution (r* + e % ) be close to s'; the second term enforces the 
constraint that the illumination vary smoothly across space; and the third 
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term enforces the constraint that the reflectance vary not too smoothly across 
space. 

Minimizing Equation A2.3 demonstrates that the solutions r l and e' 
may be obtained by filtering s l through a linear filter. In the Fourier domain 
we derive the result: 

r {UJ) = (1 + {3e-"*° 2 + p-fu* + f3uj 2 e-" 2 * 2 + /3 7 o; 4 )(l + \u 2 ) - 1 * ^* 

(A2.4) 

Note that the quadratic variational principle of standard regularization 
cannot enforce the spatial regularization constraint with full generality. A 
more general regularization scheme based on Markov random fields, which 
leads to standard regularization as a special case, is sketched by Poggio and 
staff (1985). 
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Appendix 3 
Learning a Regularization Algorithm 



3.1 Associative Learning of Standard Regularizing Operators 

Minimization of the regularization principle A2.2 corresponds to deter- 
mining a regularizing operator that acts on the input data y and produces 
as an output the regularized solution z. Suppose now that instead of solv- 
ing for z, we are given y and its regularized solution z and we want to find 
the operator that effects the transformation between them. This appendix 
demonstrates that the regularizing operator can be synthesized by associa- 
tive learning from a set of examples. The argument consists of two claims, 
explored in detail below. The first claim is that the regularizing operator 
corresponding to a quadratic variational principle is linear. The second is 
that any linear mapping between two vector spaces can be synthesized by 
an associative scheme based on the computation of the pseudoinverse of the 
data. 

3.1.1 Linearity of the regularized solution 

To show that variational principles of the form of Equation A2.2 lead 
to a regularized solution that is a linear transformation of the data, we start 
with the discrete form of Equation A2.2: 

||^-y|| 2 + A||P^|| 2 , (A3.1) 

in which z and y are vectors and A and the Tikhonov stabilizer P are ma- 
trices, A does not depend on the data, and || • || is a norm. 

The minimum of this functional will occur at its unique stationary point 
z. To find z, we set to zero the gradient with respect to z of Equation 
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A3.1. The solution to the resulting Euler-Lagrange equations is the minimum 
vector z: 

(A T A + \P T P)z = A T y. (A3. 2) 

It follows that the solution z is a linear transformation of the data y: 

z = Ly, (A3.3) 

where L is the linear regularizing operator. (If the problem were well-posed, 
the operator L would equal simply the inverse of A.) It is important to note 
that L may depend on the given lattice of data points. 

3.1.2 Learning a linear mapping 

Given that the mapping between a set of input vectors y and their 
regularized solutions z is linear, how do we solve for it? We start by arranging 
the sets of vectors in two matrices Y and Z. The problem of synthesizing the 
regularizing operator L is then equivalent to "solving" the following equation 
for L: 

Z = LY (A3A) 

A general solution to this problem is given by 

L = ZY+, (A3.5) 

where Y + is the pseudoinverse of Y. This is the solution which is most robust 
against errors, if Equation A3.4 admits several solutions and it is the optimal 
solution in the least-squares sense, if no exact solution of Equation A3. 4 
exists. This latter case is the one of interest to us: in order to overconstrain 
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the problem, and so avoid look-up table solutions, we require that the number 
of examples (columns of Y) be larger than the rank of the matrix L. In this 
case, there is no exact solution of Equation A3. 4 and the matrix L is chosen 
instead to minimize the expression 

M=\\LY-Zf. (A3.6) 

L may be computed directly by minimizing A3. 6, which yields 

L = ZY T (YY T )~ 1 (AZ.7) 

In practice, we compute L using Equation A3. 7, but first regularize it 
by adding a stabilizing functional to obviate problems of numerical stability 
(Tikhonov and Arsenin, 1977). 

These results show that the standard regularizing operator L (parame- 
trized by the lattice of data points) can be synthesized without need of an 
explicit variational principle, if a sufficient set of correct input-output pairs is 
available to the system. Note that by supplying as examples the physically 
correct solutions z, we assume that they are identical to the regularized 
solutions z, and enforce both regularization and correctness on the linear 
operator we obtain. 

3.2 Recursive estimation of L 

It is of particular import for practical applications that the pseudoinverse 
can be computed in an adaptive way by updating it when new data become 
available (Albert, 1972). Consider again Equation A3.7. Assume that the 
matrix Y consists of n — 1 input vectors and Z of the corresponding correct 
outputs. We rewrite Equation A3. 7 as 
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L n _! = Z n _ 1 y+_ ! (43.8) 

If another input-output pair y n and z n becomes available, we can compute 
L n recursively by 



L n = L n -i + (z n - L n -iy n )t„, (A3.9) 

where 

*» l + yJC^-il^LO-iyn' (A3 ' 10) 

provided that (Yn^Y^)' 1 exists (i.e., that the number of columns in Y is 
greater than or equal to the dimension of y). The case in which (^n-i^^-i) -1 
does not exist is discussed together with more general results in Albert 
(1972). Note that (z n - X n _!j/ n ) in the updating Equation A3.9 is the error 
between the desired output and the predicted one, in terms of the current 
L. The coefficient t n is the weight of the correction: with the value given 
by Equation A3. 10 the correction is optimal and cannot be improved by any 
iteration without new data. A different value of the coefficient is suboptimal 
but may be used to converge to the optimal solution by successive iterations 
of Equation A3. 10 using the same data. 



26 



Appendix 4 
Optimal Linear Estimation, 

Regression and Bayesian Estimation 



The optimal linear estimation scheme we have described is closely related 
to a special case of Bayesian estimation in which the best linear unbiased 
estimator (BLUE) is found. Consider Equation A3. 3: the problem is to 
construct an operator L that provides the best estimation Ly of z. We assume 
that the vectors y and z are sample sequences of gaussian stochastic processes 
with, for simplicity, zero mean. Under these conditions the processes are fully 
specified by their correlation functions 

E[yy T ] = C yv , E[zy T ] = C zy (AAA) 

where E indicates the expected value. The BLUE of z (see Albert, 1972) is, 
given y, 

z est = C^y, (.44.2) 

which is to be compared with the regression equation 

Ly = ZY T (YY T )~ 1 y. (A4.3) 

The quantities ZY T and YY T are approximations to C zy and C yy , re- 
spectively, since the quantities are estimated over a finite number of observa- 
tions (the training examples). Thus there is a direct relation between BLUEs 
and optimal linear estimation. The learned operator captures the stochastic 
regularities of the input and output signals. Note that if the input vectors 
y are orthonormal, then L = ZY T and the problem reduces to constructing 
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a simple correlation memory of the holographic type (see Poggio, 1975a,b). 
Under no restrictions on the vectors y, the correlation matrix ZY T may 
still be considered as a low-order approximation to the optimal operator (see 
Kohonen, 1977). 
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